Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

758 investigate extracts to identify areas of code which can be cut down for processing times #899

Conversation

lizihao-anu
Copy link
Contributor

As discussed in #758, two functions, process_sc_all_sds and process_sc_all_alarms_telecare, take a very long time to execute. I have re-written the functions using data.table to improve the efficiency, including the speed and memory usage. The improvement is ridiculously significant.

Here's a summary comparing the improvements in two functions, process_sc_all_sds and process_sc_all_alarms_telecare:

Function Name Old Method New Method Improvement
process_sc_all_sds 24.94 mins 1.69 mins 94% reduction
process_sc_all_alarms_telecare 31.81 mins 0.81 mins 98% reduction

In both cases, the new methods produce identical results to the old methods but demonstrate significant reductions in running time. The process_sc_all_sds function saw a 94% reduction in running time, while the process_sc_all_alarms_telecare function experienced an even more impressive 98% reduction. These improvements are a testament to the effectiveness of the rewritten functions in optimizing their performance.

@lizihao-anu lizihao-anu changed the base branch from master to mar-23-update January 22, 2024 18:21

This comment has been minimized.

lizihao-anu and others added 2 commits January 22, 2024 18:26
…y-areas-of-code-which-can-be-cut-down-for-processing-times

This comment has been minimized.

…y-areas-of-code-which-can-be-cut-down-for-processing-times

This comment has been minimized.

Copy link
Collaborator

@SwiftySalmon SwiftySalmon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ran the sds code and it's much much faster.

Only issue is that I've made a lot of changes to the sds code! Might need to merge my pull request and then make some changes to this one?

@lizihao-anu
Copy link
Contributor Author

Ran the sds code and it's much much faster.

Only issue is that I've made a lot of changes to the sds code! Might need to merge my pull request and then make some changes to this one?

Yes. I agree. We can merge your changes to sds first, and the I can rewrite the code for sds.

@Jennit07
Copy link
Collaborator

Jennit07 commented Feb 7, 2024

Hi @lizihao-anu PR #900 has now merged. When you are back from A/L would you be able to have a look at the SDS code again and make any changes to have this in data table code please? Thanks!

@Jennit07 Jennit07 added On hold Waiting for something / someone outside of our control Priority: High Mar23-update labels Feb 7, 2024
@lizihao-anu lizihao-anu requested a review from Jennit07 February 28, 2024 10:42

This comment has been minimized.

This comment has been minimized.

…y-areas-of-code-which-can-be-cut-down-for-processing-times

This comment has been minimized.

Copy link

@check-spelling-bot Report

🔴 Please review

See the 📂 files view, the 📜action log, or 📝 job summary for details.

Unrecognized words (3)

fcase
fifelse
setorder

To accept these unrecognized words as correct, you could run the following commands

... in a clone of the [email protected]:Public-Health-Scotland/source-linkage-files.git repository
on the 758-investigate-extracts-to-identify-areas-of-code-which-can-be-cut-down-for-processing-times branch (ℹ️ how do I use this?):

curl -s -S -L 'https://raw.githubusercontent.com/check-spelling/check-spelling/main/apply.pl' |
perl - 'https://github.com/Public-Health-Scotland/source-linkage-files/actions/runs/8079543041/attempts/1'

OR

To have the bot accept them for you, reply quoting the following line:
@check-spelling-bot apply updates.

Errors (2)

See the 📂 files view, the 📜action log, or 📝 job summary for details.

❌ Errors Count
❌ ignored-expect-variant 13
ℹ️ no-newline-at-eof 1

See ❌ Event descriptions for more information.

If the flagged items are 🤯 false positives

If items relate to a ...

  • binary file (or some other file you wouldn't want to check at all).

    Please add a file path to the excludes.txt file matching the containing file.

    File paths are Perl 5 Regular Expressions - you can test yours before committing to verify it will match your files.

    ^ refers to the file's path from the root of the repository, so ^README\.md$ would exclude README.md (on whichever branch you're using).

  • well-formed pattern.

    If you can write a pattern that would match it,
    try adding it to the patterns.txt file.

    Patterns are Perl 5 Regular Expressions - you can test yours before committing to verify it will match your lines.

    Note that patterns can't match multiline strings.

@lizihao-anu
Copy link
Contributor Author

Ready to merge

@Jennit07 Jennit07 merged commit 6dab611 into mar-23-update Feb 28, 2024
11 of 12 checks passed
@Jennit07 Jennit07 deleted the 758-investigate-extracts-to-identify-areas-of-code-which-can-be-cut-down-for-processing-times branch February 28, 2024 12:46
Jennit07 added a commit that referenced this pull request Mar 26, 2024
* Remove redundant code

* Update documentation

* Style code

* Reorder when we match on client variables
This was causing NSUs to show a social care id. This now resolves this.

* Update documentation

* Style code

* Revert "Update logic to use end of Quarter"

This reverts commit 004e831.

* Style code

* Update documentation

* add check comment (TO DO for this PR)

* Remove `check_quarter_format` function

* Remove `check_quarter_format`

* Add chi parameter to `create_demog_test_flags`

* Style code

* Use CHI parameter for ep/indiv tests

* Use CHI parameter for extract tests (chi)

* Change test sheet names to lowercase

* Change date to lowercase

* Update documentation

* Update documentation

* Update documentation

* Style code

* Fix pick variables
This was not taking the correct variables, leading to NSUs being assigned psychiatry

* SC Demographics and SDS (#900)

* Style code

* # read in sc demographics

different variables - removed extract date as not accurate, using chi over upi after discussion with social care data management. Added in date of death just for fun.

* social care demographics first draft

removed a lot of the submitted variables and instead using chi variables from chi seeding. Other changes:
- Fill in missing values,
- create flag for latest social care id (one from database is not accurate), this makes sure that each chi only has ONE sc id as the latest to stop it creating duplicates
- change postcode to choose chi over submitted

* Style code

* had a github error? Not sure what happened but commiting first draft of sc demographics

* Style code

* first draft sds.
No major changes - only how demographics is matched on and how latest social care id is selected

* Update documentation

* demographics - add sending location to group by

* Style code

* Update documentation

* Added ungroup()

* Remove comments

* Remove comments

* Style code

---------

Co-authored-by: SwiftySalmon <[email protected]>
Co-authored-by: marjom02 <[email protected]>
Co-authored-by: Jennit07 <[email protected]>
Co-authored-by: Jennit07 <[email protected]>
Co-authored-by: Zihao Li <[email protected]>

* Sc all at speedup (#904)

* speed up process_sc_all_alarms_telecare function with data.table package

* Update documentation

---------

Co-authored-by: lizihao-anu <[email protected]>
Co-authored-by: Megan McNicol <[email protected]>
Co-authored-by: Jennit07 <[email protected]>

* Add case_when statement for `high_cc` cohort

* Bug - `high_cc` in demographic cohort showing `NAs` instead of `TRUE/FALSE` (#911)

Add case_when statement for `high_cc` cohort

* added a casewhen to update property type description for homelessness

* Update documentation

* Style code

* Bug - deal with missing variables (#914)

* Add missing sc variables for no sc data

* Fix code for including `_inc_dna` variables

* Remove commented line

* Bug - Fix get pop path failing and preventing the indiv file from running.  (#913)

Fix bug - pop file paths breaking indiv file

* correct file hscp file path

* Update process_sc_all_home_care.R

A small issue was identified when running targets. Linked with changes to the function `fix_sc_end_dates()`

* Update process_sc_all_alarms_telecare.R

* remove duplicate columns

* Fix targets (#892)

* fix sc_client_lookup sc_send_lca

* fix an issue of get_pop_path

* Style code

* fix the rest of get_pop_path from get_datazone_pop_path

* Update documentation

* fix sc_send_lca

* add missing year column

* explicitly specify the argument year to avoid corruption of targets

* Update documentation

* new data pipeline with targets
remove create_individual_files from targets and append it to run_targets script

* minor changes

* Style code

* undo sc_send_lca bit

* Update targets scripts

* Remove top level targets scripts

---------

Co-authored-by: lizihao-anu <[email protected]>
Co-authored-by: Megan McNicol <[email protected]>
Co-authored-by: Jennit07 <[email protected]>
Co-authored-by: Jennifer Thom <[email protected]>

* remove cases that start date is later than end date

* Update Refs for March24 SLF update

* 758 investigate extracts to identify areas of code which can be cut down for processing times (#899)

* re-writing process_sc_all sds and alarm_telecare with data.table to improve the speed

* Update documentation

* Style code

* changes in line with new process_sc_all_sds dplyr version

* Style code

* remove duplicate columns

* remove duplicated columns

---------

Co-authored-by: lizihao-anu <[email protected]>
Co-authored-by: Megan McNicol <[email protected]>

* Update homelessness completeness path

* Update check_year_valid function

* 920 issues with file permissions need constant monitoring (#921)

* set a correct file permission

* update descriptions in process_tests function

* Update documentation

---------

Co-authored-by: lizihao-anu <[email protected]>

* change joining with sc_demog_lookup to right_join and move person_id down

* Archive social care extracts (#927)

* Set up `get_sandpit_extract_path`

* Update documentation

* Update sc `all` data paths

* Write sandpit extract if file does not exist

* Style code

---------

Co-authored-by: Jennit07 <[email protected]>

* Update excel sg completeness tabs

---------

Co-authored-by: Jennit07 <[email protected]>
Co-authored-by: Megan McNicol <[email protected]>
Co-authored-by: SwiftySalmon <[email protected]>
Co-authored-by: marjom02 <[email protected]>
Co-authored-by: Zihao Li <[email protected]>
Co-authored-by: lizihao-anu <[email protected]>
Co-authored-by: rchlv <[email protected]>
Co-authored-by: Zihao Li <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
On hold Waiting for something / someone outside of our control Priority: High
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Investigate extracts to identify areas of code which can be cut down for processing times
3 participants